This page is a reworking of the original t-SNE article using the Computo template. It aims to help authors submitting to the journal by using some advanced formatting features. We warmly thank the authors of t-SNE and the editor of JMLR for allowing us to use their work to illustrate the Computo spirit.
Generate random number from copula : a COPPY module
Alexis Boulin, aboulin@unice.fr, Laboratoire J.A. Dieudonné, UMR CNRS 7351, Université Côte d’Azur, Nice, France
Last build: march 2022
Abstract
The modeling of dependence between random variables is an important subject in several applied fields of science. To this aim the copula function can be used as a margin-free description of the dependence structure. Several copulae belong to specific families such as Archimedean, Elliptical or Extreme. While software implementation of copulae has been thoroughly explored in R software, methods to work with copula in \textsf{Python} are still in their infancy. To promote the dependence modeling with copula in \textsf{Python}, we have developed COPPY, a library that provides a range of random vector generation vector for copulae.
Keywords
Random number generation, copulas
1 Introduction
Modeling dependence relations between random variables is a topic of interest in probability theory and statistics. The most popular approach is based on the second moment of the underlying random variables namely, the covariance. It is well known that only linear dependence can be captured by the covariance and it is characterizing only for a few models, the multivariate normal distribution. As a beneficial alternative of dependence, the concept of copulae going back to Sklar (1959). The copula C : [0,1]^d \rightarrow [0,1] of a random vector \textbf{X} = (X_0, \dots, X_{d-1}) with d \geq 2 allows us to separate the effect of dependence from the effect of the marginal distribution such as:
where (x_0, \dots, x_{d-1}) \in \mathbb{R}^d. The main consequence of this identity is that the copula completely characterizes the stochastic dependence between the margins of \textbf{X}.
In other words, copulae allow to model marginal distributions and dependence structure separately. Furthermore, motivated by Sklar’s theorem, the problem of investigating stochastic dependences is being reduced to the problem of investigating multivariate distribution function on the unit hypercube [0,1]^d with uniform margins. The theory of copulae has been of prime interest for many applied fields of science, such as quantitative finance, actuarial or environmental sciences. This increasing number of applications has led to a demand for statistical methods. To name some, semiparametric estimation C. Genest, Ghoudi, and Rivest (1995), nonparametric estimation of copulae where investigated while Gijbels, Omelka, and Veraverbeke (2015), Portier and Segers (2018) analyse the weak convergence of the nonparametric estimation of conditional copulae. Such results are established under a fixed arbitrary fixed dimension d \geq 2 but several investigations (see Einmahl and Lin (2006), Einmahl and Segers (2021)) are done for functional data for the tail copula which capture the dependence in the upper tail. The empirical version of this function is shown to converge weakly to a Gaussian process which is similar in structure to the weak limit of the empirical copula process.
While software implementation of copulas has been thoroughly explored in \textsf{R} software see, , A. G. Stephenson (2002), Jun Yan (2007), Schepsmeier et al. (2019), methods to work with copula in \textsf{Python} are still in their infancy. As far as we know, copulas devoted module in \textsf{Python} are mainly designed for modeling (see Alvarez et al. (2021), Bock and Chapman (2021)). These modules use maximum likelihood methods to estimate parametrically the copula from observed data and propose to sample synthetic data through the estimated copula model. Nevertheless, it does not allow to generate random numbers from a given copula model where the user just have to specify the desired parameter. Furthermore, only archimedean and elliptical families are investigated and the multivariate case has barely been touched. Here, we propose to implement a wide range of copulas which include the extreme value class and, if possible, in arbitrary fixed dimension d \geq 2.
Through this article we adopt the following notational conventions such as all the indices will start at 0 as in \textsf{Python}. Consider (\Omega, \mathcal{A}, \mathbb{P}) and let \textbf{X} = (X_0, \dots, X_{d-1}) be d-dimensional random vector with values in (\mathbb{R}^d, \mathcal{B}(\mathbb{R}^d)), with d \geq 2. This random vector has a joint distribution F with copula C and its margins are denoted by F_j(x) = \mathbb{P}\{X_j \leq x\} for all x \in \mathbb{R} and j \in \{0, \dots, d-1\}. Denote by \textbf{U} = (U_0, \dots, U_{d-1}) a d random vector with copula C and uniform margins. All bold letters \textbf{x} will denote a vector.
The module COPPY is implemented using the object-oriented features of the \textsf{Python} language. The classes are designed for Archimedean, elliptical, extreme value copulae. Section 2 briefly presents the classes defined in the module. Section 3 describes methods to generate random vectors. Section 4 presents an application of the COPPY in the field of modelization of pairwise dependence between maxima. Section 5 discusses some further improvements of the module and concludes. Section 6.1 - Section 6.5 are devoted to define and illustrate all the parametric copula models implemented in the module.
2 Classes
Figure 1: Object diagram that structure the code. The Multivariate class is the deepest root and serves to instantiate all its childs Archimedean, Extreme, Gaussian, Student in red. Grandchilds corresponds to some parametric copula models and are depicted in blue. Examples of methods are the great-grandchild colored in green.
Three main classes are defined in the COPPY module : Multivariate, Archimedean and Extreme classes (see Figure 1 for a description of the architecture of the code). The Multivariate class is designed to define multivariate copula (including the bivariate case) and is located at the highest level of the code architecture. As a generic class, it contains methods to instantiate the copula object or to sample from a copula with desired margins using inversion methods for example. At a second level, we may find its childrens, namely Archimedean, Extreme and Gaussian or Student. These are copulae which belong to different families of copulae, namely Archimedean and extreme value copula for Archimedean and Extreme and elliptical copula for Gaussian and Student. This partition is relevant theoretically as Archimedean and extreme value copulae are studied on their own (see Charpentier and Segers (2009) or Christian Genest and Segers (2009) to name a few) but also in practice as they contain effective sampling methods. However, Gaussian and Student classes are split as the most effective sampling algorithms are specific for each and cannot be generalized in a broader elliptical class. At the third level of the architecture, we may find some important Archimedean and extreme value copulae parametric models (depicted as blue in Figure 1). These parametric models contain methods such as the generator function \varphi (see Section 2.1) for the Archimedean and the Pickands dependence function A (see Section 2.2) for the extreme value (in green in Figure 1). We recall in Section 2.1 the definition of Archimedean copulae and some of its properties in high dimensional spaces. A characterization of extreme value copulae is given in Section 2.2. For each section, the reader is referred to the to discover the wide range of copulae implemented in the module as Section 6.1 and Section 6.3 for Archimedean copulae, Section 6.2 and Section 6.4 for extreme value copulae and lastly Section 6.5} for Elliptical copulae.
2.1 The Archimedean class
Let \varphi be a generator which is a strictly decreasing, convex function from [0,1] to [0, \infty] such that \varphi(1) = 0 and \varphi(0) = \infty. We denote by \varphi^\leftarrow the generalized inverse of \varphi. Let denote by
If this relation holds and C is a copula function, then C is called an Archimedean copula. A characterization for Equation 1 to be a copula is that the generator needs to be a d-monotonic function, \varphi is differentiable there up to the order d and the derivatives satisfy
for x \in (0, \infty) (see corollary 2.1 of McNeil and Nešlehová (2009)). It is peculiarly interesting to emphasize that d-monotone Archimedean inverse generators do not necessarily generate Archimedean copulae in dimensions higher than d. To that extent, some Archimedean subclasses are only implemented for the bivariate case as in a greater dimension, they do not generate an Archimedean copula. In the bivariate case, it is worth noticing that Equation 2 can be interpreted as \varphi be a convex function.
Implemented Archimedean copula classes in the module are commonly used one-parameter families, such as Clayton (Clayton (1978)), Gumbel (Gumbel (1960)), Joe (Joe (1997)), Frank Frank (1979) and AMH (Ali, Mikhail, and Haq (1978)) copulae for the multivariate case. Remark that every Archimedean copulae are always symmetric and in dimension 3 or higher only positive association are allowed. For the specific bivariate case, other families such as numbers from 4.2.9 to 4.2.15 and 4.2.22 of Section 4.2 of Nelsen (2007) are implemented. We refer the reader to Section 6.1 and Section 6.3 for definitions and illustrations of these parametric copula models.
2.2 The Extreme class
Investigating the notion of copulae within the framework of multivariate extreme value theory leads to the so-called extreme value copulae (see Gudendorf and Segers (2010) for an overview) defined as
C(\textbf{u}) = \exp \left( - \ell(-\ln(u_0), \dots, -\ln(u_{d-1})) \right), \quad \textbf{u} \in (0,1]^d,
\qquad(3) with \ell : [0,\infty)^d \rightarrow [0,\infty) the stable tail dependence function which is convex, homogeneous of order one, namely \ell(c\textbf{x}) = c \ell(\textbf{x}) for c > 0 and satisfies \max(x_0,\dots,x_{d-1}) \leq \ell(x_0,\dots,x_{d-1}) \leq x_0+\dots+x_{d-1}, \forall \textbf{x} \in [0,\infty)^d. Denote by \Delta^{d-1} = \{\textbf{w} \in [0,1]^d : w_0 + \dots + w_{d-1} = 1\} the unit simplex. By homogeneity, \ell is characterized by the A : \Delta^{d-1} \rightarrow [1/d,1], which is the restriction of \ell to the unit simplex \Delta^{d-1} :
\ell(x_0, \dots,x_{d-1}) = (x_0 + \dots + x_{d-1}) A(w_0, \dots, w_{d-1}), \quad w_j = \frac{x_j}{x_0 + \dots + x_{d-1}},
\qquad(4) for j \in \{1,\dots,d-1\} and w_0 = 1 - w_1 - \dots - w_{d-1} with \textbf{x} \in [0, \infty)^d \setminus \{\textbf{0}\}.
From a practical point of view, the family of extreme value copulae is very rich and arises naturally as the limiting distribution of properly normalised component-wise maxima. Furthermore, it contains a rich variety of parametric models and allows asymmetric dependence. For the multivariate framework, the logistic copula (or Gumbel, see Gumbel (1960)) and the asymmetric logistic copula (Tawn (1990)) are implemented. We emphasize here that the logistic copula is the sole model that is both Archimedean and extreme value. Bivariate extreme value copulae which are included in the library are asymmetric logistic (Tawn (1988)), asymmetric negative logistic (Joe (1990)), asymmetric mixed (Tawn (1988)), Hüsler and Reiss (Hüsler and Reiss (1989)), the t-EV (Demarta and McNeil (2005)),Bilogistic (Smith (1990)), Dirichlet ( Boldi and Davison (2007)) models. The reader is again invited to the Section 6.2 and Section 6.4 for precise definitions of these models.
3 Random number generator
We propose a \textsf{Python}-based implementation for generate random number from a wide variety of copula. COPPY requires some external packages in order to work. They are very few are freely available online.
NumPy version 1.6.1 or newer. This is the fundamental package for scientific computing, it contains linear algebra functions and matrix / vector objects (Van Der Walt, Colbert, and Varoquaux (2011)).
SciPy version 1.7.1 or newer. A library of open-source software for mathematics, science and engineering (Jones, Oliphant, and Peterson (2001)).
The random vector generator methods in the code are sample_unimargin and sample for the Multivariate class. The first one generates a sample where margins are uniformly distributed on the unit segment [0,1] while the second from the chosen margins. In Section 3.1, we present an algorithm used to sample from a copula using the conditioning method. This method is very general and may be used for every copula that is sufficiently smooth (see Equation 6, Equation 7 and Equation 8 below). However, the practical infeasibility of the algorithm in a dimension higher than 2 and the numerical inversion which is computationally intensive call for efficient ways to sample in greater dimensions. The purpose of Section 3.2 is to present those methods and details those used in the module. In each Section, typical lines of codes are presented to instantiate a copula and to sample with COPPY.
In the sequel, all \textsf{Python} code demonstration assumes that the modules have been loaded :
Show the code
from coppy.rng import base, evd, archimedeanimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsplt.style.use('seaborn-whitegrid')
3.1 The bivariate case
In this subsection, we will address the problem of generating a bivariate sample from a specified joint distribution with d = 2. Suppose that we want to sample a bivariate random vector \textbf{X} with copula C. If the components are independents, the procedure is straightforward as we can sample in one hand from X_0 and from X_1 on the other hand. In the case of copula, this case is only a special one.
One procedure to generate a pair (u_0,u_1) from \textbf{U} is the conditional distribution method. This method is convenient if the law of U_1 given U_0 can be easily sampled. An interesting property of copulae is the link between the conditional distribution and the derivative of the copula function (see Nelsen (2007), page 41), namely
Thus, a convenient algorithm to simulate bivariate copula as long as the copula admits a first partial derivative with respect to its first component is given by Algorithm 1.
For the sixth step of Algorithm 1, it is equivalent to find u_1 \in [0,1] such that c_{u_0}(u_1) - t_1 = 0 holds. This u_1 always exists because for every u \in ]0,1[, $ 0 c_{u_0}(u) $ and that u \mapsto c_{u_0}(u) is an increasing function (see Theorem 2.2.7 of for a proof). This step is solved using the function from SciPy module. A sufficient condition in order that C admits a first partial derivative in the Archimedean and extreme value case is that the generator \varphi and the Pickands dependence function A are continuously differentiable on ]0,1[. Then the first partial derivatives of C are given respectively by :
where t = log(u_1) / log(u_0u_1) \in (0,1) and \mu(t) = A(t) - tA'(t).
We now have all the necessary theoretical tools to give details on how the COPPY module is designed. The file base.py contains the Multivariate class where the sample method sample from \textbf{X} with copula C. To do so, we use the inversion method that is to sample from \textbf{U} using Algorithm 1 and we compose the corresponding uniform margins by F_j^\leftarrow. Equation 5 indicate that the sole knowledge of A and \varphi and their respective derivatives is needed in order to perform the sixth step of Algorithm 1. For that purpose, Algorithm 1 named as cond_sim in the code and is located inside the Archimedean and Extreme class. Then each child of the bivariate Archimedean ( Extreme) class is thus defined by its generator \varphi (resp. A), it’s derivative \varphi' ( A') and it’s inverse \varphi^\leftarrow as emphasized in greed in Figure 1. Namely, we perform Algorithm 1 for the Archimedean subclasses Frank, AMH, Clayton (when \theta < 0 for the previous three), Nelsen_9, Nelsen_10, Nelsen_11, Nelsen_12, Nelsen_13, Nelsen_14, Nelsen_15 and Nelsen_22. For the Extreme class, such algorithm is performed for the Asy_neg_log and Asy_mixed models. For other models, faster algorithms are known and thus implemented, we refer to Section 3.2 for details.
The following code illustrates the random vector generation for a bivariate Archimedean copula. By defining the parameter of the copula and the sample’s length, constructor of this copulae are available and might be call using Clayton() method such as :
We will now address the simulation of multivariate Archimedean and Extreme value copulae proposed on the COPPY module. In the multivariate case, the link between partial derivatives conditioning remains. Indeed, let (U_0, \dots, U_{d-1}) be a d-dimensional random vector with uniform margins and copula C, the conditional distribution of U_k given the values of U_0, \dots, U_{k-1} is
for k \in \{1,\dots, d-1\}. The conditional simulation algorithm may be written as follows.
Generate d independent uniform random on [0,1] variates v_0, \dots, v_{d-1}.
Set u_0 = v_0.
For k = 1, \dots, d-1, evaluate the inverse of the conditional distribution given by Equation 8 at v_k, to generate u_k.
Nevertheless, evaluation of the inverse conditional distribution becomes increasingly complicated as the dimension d increases. Furthermore, it can be difficult for some models to derive a closed form of Equation 8 that makes it impossible to implement it in a general algorithm where only the dimension d is an input. For multivariate Archimedean copulae, McNeil and Nešlehová (2009) give a method to generate a random vector from the d-dimensional copula C with generator \varphi (see Section 5.2 of loc. cit.). A stochastic representation for Archimedean copulae generated by a d-monotone generator is given by
where R \sim F_R, the radial distribution which is independent of S and S is distributed uniformly in the unit simplex \Delta^{d-1}. One of challenging aspect of this algorithm is to have an accurate evaluation of the radial distribution of the Archimedean copula and thus to numerically inverse this distribution. The associated radial distribution for the \textsf{Clayton} copula is given in Example 3.3 in the cited paper above while those of the \textsf{Joe}, \textsf{AMH}, \textsf{Gumbel} and \textsf{Frank} copulae are given in Hofert, Mächler, and McNeil (2012). In general, one can use numerical inversion algorithms for computing the inverse of the radial distribution, however it will leads to spurious numerical errors. Other algorithms exist when the generator is known to be the Laplace-Stieljes transform, denoted as \mathcal{LS}, of some positive random variable (see Marshall and Olkin (1988), Frees and Valdez (1998)). This positive random variable is often referenced as the frailty distribution. In this framework, Archimedean copulae allow for the stochastic representation
with V \sim F = \mathcal{LS}^{-1}[\varphi^\leftarrow] the frailty and E_1, \dots, E_d are distributed i.i.d. according to a standard exponential and independent of V. The algorithm to sample from this is thus :
In this framework, we define \texttt{\_frailty\_sim} method defined inside the \textbf{Archimedean} class which performs Algorithm 2. Then, each Archimedean copula where the frailty distribution is known are thus defined by the generator \varphi, it’s inverse \varphi^\leftarrow and the frailty distribution denoted as \mathcal{LS}^{-1}[\varphi^\leftarrow] as long as we know the frailty. This is the case for \texttt{Joe}, \texttt{Clayton}, \texttt{AMH} or \texttt{Frank}.
For the extreme value case, algorithms have been proposed as in A. Stephenson (2003) (see Algorithms 2.1 and 2.2) who proposes sampling methods for the Gumbel and the asymmetric logistic model. These algorithms are implemented in the COPPY module. Note that these algorithms are model-specific, thus the \texttt{sample\_unimargin} method is exceptionally located in the corresponding child of the multivariate Extreme class. Another procedure designed by Dombry, Engelke, and Oesting (2016) to sample from multivariate extreme value models using extremal functions (see Algorithm 2 of the reference cited above) is also of prime interest. For the implemented models using this algorithm, namely Hüsler-Reiss, tEV, Bilogistic and Dirichlet models, a method called \texttt{rExtFunc} is located inside each classes which allows to generate an observation from the according law of the extremal function.
Sample from the Gaussian and Student copula are directly given by Algorithm 5.9 and 5.10 respectively of Alexander J. McNeil (2005). As each algorithm is model specific, the \texttt{sample\_unimargin} method is located inside the Gaussian and Student classes.
We present how to construct a multivariate Archimedean copula and to generate random vectors from this model. Introducing the parameters of the copula, we appeal the following lines to construct our copula object :
4 Case study : Modeling pairwise dependence between spatial maximas with missing data
We now proceed to a case study where we use our module to assess, under a finite sample framework, the asymptotic properties of an estimator of the \lambda-madogram when data are completely missing at random (MCAR). This case study comes from numerical results of Boulin et al. (2021). The \lambda-madogram belongs to a family of estimators, namely the madogram, which is of prime interest in environmental sciences, as it is designed to model pairwise dependence between maxima in space, see Bernard et al. (2013)Bador et al. (2015)Saunders, Stephenson, and Karoly (2021) where the madogram was used as a dissimilarity measure to perform clustering. Where in several fields, for example econometrics (Wooldridge (2007)) or survey theory (Boistard, Chauvet, and Haziza (2016)), the MCAR hypothesis appears to be a strong hypothesis, this hypothesis is more realistic in environmental research as the missingness of one observation is usually due to instruments, communication and processing errors that may be reasonably supposed independent of the quantity of interest. In Section 4.1, we define objects and properties of interests while in Section 4.2 we describe a detailed tutorial in \textsf{Python} and with COPPY module to compare the asymptotic variance with an empirical counterpart of the \lambda-madogram with \lambda = 0.5.
4.1 Background
It was emphasized that the possible dependence between maxima can be described with the extreme value copula. This function is completely characterized by the Pickands dependence function (see Equation 4}) where the latter is equivalent to the \lambda-madogram introduced by Naveau et al. (2009) and defined as
with \lambda \in (0,1) and if \lambda = 0 and 0<u<1, then u^{1/\lambda} = 0 by convention. The latter quantity took its inspiration from the extensively used geostatistics tool, namely, the variogram (see Chapter 1.3 of Gaetan and Guyon (2008) for a definition and some classical properties). The \lambda-madogram can be interpreted as the L_1-distance between the uniform margins elevated to the inverse of the corresponding weights \lambda and 1-\lambda. This quantity describes the dependence structure between extremes by its relation with the Pickands dependence function, if we suppose that C is an extreme value copula as in Equation 3, we have
with c(\lambda) = 2^{-1} (\lambda / (1-\lambda) + (1-\lambda)/\lambda) (see Proposition 3 of Marcon et al. (2017) for details).
We consider independent and identically distributed i.i.d. copies \textbf{X}_1, \dots, \textbf{X}_n of \textbf{X}. In presence of missing data, we do not observe a complete vector \textbf{X}_i for i \in \{1,\dots,n\}. We introduce \textbf{I}_i \in \{0,1\}^2 which satisfies, \forall j \in \{0,1\}, I_{i,j} = 0 if X_{i,j} is not observed. To formalize incomplete observations, we introduce the incomplete vector \tilde{\textbf{X}}_i with values in the product space \bigotimes_{j=1}^2 (\mathbb{R} \cup \{\textsf{NA}\}) such as
(\textbf{I}_i, \tilde{\textbf{X}}_i), \quad i \in \{1,\dots,n\},
\qquad(12)
for all i \in \{1, \dots,n \}, \textbf{I}_{i} are copies from \textbf{I} = (I_0, I_1) where I_j is distributed according to a Bernoulli random variable \mathcal{B}(p_j) with p_j = \mathbb{P}(I_j = 1) for j \in \{0,1\}. We denote by p the probability of observing completely a realization from \textbf{X}, that is p = \mathbb{P}(I_0=1, I_1 = 1). In Boulin et al. (2021), hybrid and corrected estimators, respectively denoted as \hat{\nu}_n^{\mathcal{H}} and \hat{\nu}_n^{\mathcal{H*}}, are proposed to estimate nonparametrically the \lambda-madogram in presence of missing data completely at random. Furthermore, a closed expression of their asymptotic variances for \lambda \in ]0,1[ is also given. This result is summarized in the following proposition.
Theorem 1 (Boulin et al. (2021)) Let (\textbf{I}_i, \tilde{\textbf{X}_i})_{i=1}^n be a samble given by Equation 12. For \lambda \in ]0,1[, if C is an extreme value copula with Pickands dependence function A, we have as n \rightarrow \infty\begin{align*}
&\sqrt{n} \left(\hat{\nu}_n^{\mathcal{H}}(\lambda) - \nu( \lambda)\right) \overset{d}{\rightarrow} \mathcal{N}\left(0, \mathcal{S}^{\mathcal{H}}(p_1,p_2,p, \lambda)\right), \\
&\sqrt{n} \left(\hat{\nu}_n^{\mathcal{H}*}(\lambda) - \nu( \lambda)\right) \overset{d}{\rightarrow} \mathcal{N}\left(0, \mathcal{S}^{\mathcal{H}*}(p_1,p_2,p, \lambda)\right),
\end{align*}
where \mathcal{S}^{\mathcal{H}}(p_1,p_2,p, \lambda) and \mathcal{S}^{\mathcal{H}*}(p_1,p_2,p, \lambda) are the asymptoptic variances of the random variables.
4.2 Numerical results
Benefiting of generating data with COPPY we are thus able, with Monte Carlo simulation, to assess theoretical results given by Theorem 1 in a finite sample setting. For that purpose, we implement a \textsf{Monte\_Carlo} class (in \texttt{monte\_carlo.py} file) which contains some methods to perform some Monte Carlo iterations for a given extreme value copula. Before going any further, we have to import the necessary libraries
We thus set up parameters to simulate our bivariate dataset. For this subsection, we choose the asymmetric negative logistic model (see Section 6.2 for a definition) with parameters \theta = 10, \psi_1 = 0.1, \psi_2 = 1.0.
The 2048 \times 2 array contains 2048 realization of the asymmetric negative logistic model where the first column is distributed according to a standard normal random variable and the second column as a standard exponential. This distribution is depicted below. To obtain it, one needs the following lines of command :
Before going into further details, we will present the missing mechanism. Let V_0 and V_1 be random variables uniformly distributed under the ]0,1[ segment with copula C_{(V_0,V_1)}. We set I_0 = 1_{\{V_0 \leq p_0\}} and $ I_1 = 1_{{V_1 p_1}}$. It is thus immediate that I_0 \sim \mathcal{B}(p_0) and I_1 \sim \mathcal{B}(p_1) and p \triangleq \mathbb{P}\{I_0 = 1, I_1 =1 \} = C_{(V_0,V_1)}(p_0, p_1). For our illustration, we will take C_{(V_0,V_1)} as a \texttt{Joe} copula with parameter \tau = 2.0 (We refer to Appendix Section 6.1} for a representation of this copula). Thereby, if X_0 is not observed, it is more likely to not observe X_1 also. A contrario, the observation of X_0 doesn’t influence the observation of the other random variable X_1. Indeed, for this copula, more likely is to observe a realization v_0 \geq 0.8 from V_0 if v_1 \geq 0.8 from V_1. If we observe v_1 < 0.8, the realization v_0 is close to be independent of v_1. In climate studies, extreme events could damage the recording instrument in the surrounding regions where they occur, thus the missingness of one variable may depend from others. We initialize the copula C_{(V_0,V_1)} by the following line
For a given \lambda \in ]0,1[ we now want to estimate a \lambda-madogram with a sample simulate from the asymmetric negative logistic and where some observations would be of lack by the missing mechanism described above. We thus replicate this step several times to compute an empirical counterpart of the asymptotic variance. The \texttt{Monte\_Carlo} object has been designed in this way : we specify the number of iterations n_{iter} (take n_{iter} = 1024), the chosen extreme value copula (asymmetric negative logistic model), the missing mechanism (described by C_{(V_0,V_1)} and p_0 = p_1 = 0.9) and \lambda (noted \texttt{w}). We thus write the following lines :
Show the code
u = np.array([0.9, 0.9]) n_iter, P, w =1024, [[u[0], copula_miss._C(u)], [copula_miss._C(u), u[1]]], np.array([0.5,0.5]) monte = monte_carlo.Monte_Carlo(n_iter = n_iter, n_sample = n_sample, copula = copula, copula_miss = copula_miss, w = w, P = P)
The \texttt{Monte\_Carlo} object is thus initialized with all parameters needed. We may use the \texttt{simu} method to generate a \texttt{DataFrame} (a \texttt{Pandas} object) composed out 1024 rows and 3 columns. Each row contains an estimate of the \lambda-madogram, \hat{\nu}_n^{\mathcal{H}*} in Theorem 1 (\texttt{FMado}), the sample length n (\texttt{n}) and the normalized estimation error (\texttt{scaled}). We thus call the \texttt{simu} method.
Where \texttt{corr = True} specifies that we compute the corrected estimator, \hat{\nu}_n^{\mathcal{H}*} in Theorem 1. Now, using the \texttt{var\_mado} method defined inside in the Extreme class, we obtain the asymptotic variance for the given model and parameters from the missing mechanism. We obtain this quantity as follows
Show the code
var_mado = copula.var_mado(w, p = copula_miss._C(u), P = P, corr =True)print(var_mado)print(df_wmado['scaled'].var())
0.015417245591834503
0.017171515076703708
We propose here to check numerically the asymptotic normality with variance \mathcal{S}^{\mathcal{H}*} of the normalized estimation error of the corrected estimator. We have all data in hand and the asymptotic variance was computed by lines above. We thus write :
Show the code
sigma = np.sqrt(var_mado) x = np.linspace(min(df_wmado['scaled']), max(df_wmado['scaled']), 1000) gauss = gauss_function(x, 0, sigma) sns.displot(data = df_wmado, x ="scaled", color = seagreen(0.5), kind ='hist', stat ='density', common_norm =False, alpha =0.5, fill =True, linewidth =1.5) plt.plot(x,gauss, color ='darkblue')
5 Discussion
This article presents the construction and some implementations of the \textsf{Python} module COPPY for random copula sampling. This is a seminal work of software implement of copula modeling in \textsf{Python} and there’s much more to do. It is implicitly hoped that the potential diffusion of the software through everyone who need it may bring other implementations for multivariate modeling with copulae under \textsf{Python}. For example, choosing a copula to fit the data is an important but difficult problem. A robust approach to estimate copulae has been investigated recently by Alquier et al. (2020) using Maximum Mean Discrepancy. In link with our example, semiparametric estimation of copulae with missing data could be of great interest as proposed by Hamori, Motegi, and Zhang (2019).
Also, implement algorithm proposed by McNeil and Nešlehová (2009) to generate random vectors for Archimedean copulae has been tackled but, as expected, the numerical inversion gives spurious results especially where the parameter \theta and the dimension d are high. Furthermore, as the support of radial distribution is contained in the real line, the numerical inversion induces a greater time of computation. Further investigations are thus needed in order to generate random vectors from classical Archimedan models using the radial distribution.
A direction of improvement for the COPPY module is dependence modeling with Vine copula which has been recently a tool of high interest in the machine learning community see, *e.g.*, Lopez-Paz, Hernández-Lobato, and Zoubin (2013), Veeramachaneni, Cuesta-Infante, and O’Reilly (2015), Carrera, Santana, and Lozano (2016), Gonçalves, Von Zuben, and Banerjee (2016) or Sun, Cuesta-Infante, and Veeramachaneni (2019). Therefore, it strengthens the need of dependence modeling with copulae in \textsf{Python} as a not negligeable part of the machine learning community use this language. In link with this article, Vine copulae might be interesting to model dependencies between extreme as suggested by Simpson, Wadsworth, and Tawn (2021), Nolde and Wadsworth (2021). Last, another copula model could be implemented to model further dependencies. These implementations will push forward the scope of dependence modeling with \textsf{Python} and gives high quality usable tools for everyone in needs.
References
Alexander J. McNeil, Paul Embrechts, Rudiger Frey. 2005. Quantitative Risk Management - Concepts, Techniques and Tools. Princeton Series in Finance. Princeton University Press. libgen.li/file.php?md5=478a0059673fecd0c76229cd3d8884e7.
Ali, Mir M, N. N Mikhail, and M.Safiul Haq. 1978. “A Class of Bivariate Distributions Including the Bivariate Logistic.”Journal of Multivariate Analysis 8 (3): 405–12. https://doi.org/https://doi.org/10.1016/0047-259X(78)90063-5.
Alquier, Pierre, Badr-Eddine Chérief-Abdellatif, Alexis Derumigny, and Jean-David Fermanian. 2020. “Estimation of Copulas via Maximum Mean Discrepancy.”https://arxiv.org/abs/2010.00408.
Alvarez, M., C. Sala, Y. Sun, J. D. Pérez, K. A. Zhang, A. Montanez, G. Bonomi, K. Veeramachaneni, I. Ramírez, and F. A. Hofman. 2021. “Copulas.”GitHub Repository. https://github.com/sdv-dev/Copulas; GitHub.
Bador, Margot, Philippe Naveau, Eric Gilleland, Mercè Castellà, and Tatiana Arivelo. 2015. “Spatial Clustering of Summer Temperature Maxima from the CNRM-Cm5 Climate Model Ensembles & e-OBS over Europe.”Weather and Climate Extremes 9: 17–24. https://doi.org/https://doi.org/10.1016/j.wace.2015.05.003.
Bernard, Elsa, Philippe Naveau, Mathieu Vrac, and Olivier Mestre. 2013. “Clustering of Maxima: Spatial Dependencies among Heavy Rainfall in France.”Journal of Climate 26 (20): 7929–37. https://doi.org/10.1175/JCLI-D-12-00836.1.
Boistard, Helene, Guillaume Chauvet, and David Haziza. 2016. “Doubly Robust Inference for the Distribution Function in the Presence of Missing Survey Data.”Scandinavian Journal of Statistics 43 (3): 683–99. https://doi.org/https://doi.org/10.1111/sjos.12198.
Boldi, M.‐O., and A. C. Davison. 2007. “A mixture model for multivariate extremes.”Journal of the Royal Statistical Society Series B 69 (2): 217–29. https://doi.org/10.1111/j.1467-9868.2007.
Boulin, Alexis, Elena Di Bernardino, Thomas Laloë, and Gwladys Toulemonde. 2021. “Non-Parametric Estimator of a Multivariate Madogram for Missing-Data and Extreme Value Framework.”arXiv Preprint arXiv:2112.13575.
Carrera, Diana, Roberto Santana, and José Antonio Lozano. 2016. “Vine Copula Classifiers for the Mind Reading Problem.”Progress in Artificial Intelligence 5: 289–305.
Charpentier, Arthur, and Johan Segers. 2009. “Tails of Multivariate Archimedean Copulas.”Journal of Multivariate Analysis 100 (7): 1521–37. https://doi.org/https://doi.org/10.1016/j.jmva.2008.12.015.
Clayton, D. G. 1978. “A Model for Association in Bivariate Life Tables and Its Application in Epidemiological Studies of Familial Tendency in Chronic Disease Incidence.”Biometrika 65 (1): 141–51. http://www.jstor.org/stable/2335289.
Demarta, Stefano, and Alexander J. McNeil. 2005. “The t Copula and Related Copulas.”International Statistical Review 73 (1): 111–29. https://doi.org/https://doi.org/10.1111/j.1751-5823.2005.tb00254.x.
Dombry, Clément, Sebastian Engelke, and Marco Oesting. 2016. “Exact simulation of max-stable processes.”Biometrika 103 (2): 303–17. https://doi.org/10.1093/biomet/asw008.
Einmahl, John H. J., and Tao Lin. 2006. “Asymptotic Normality of Extreme Value Estimators on \mathcal{C}([0, 1]).”The Annals of Statistics 34 (1): 469–92. http://www.jstor.org/stable/25463423.
Einmahl, John H. J., and Johan Segers. 2021. “Empirical tail copulas for functional data.”The Annals of Statistics 49 (5): 2672–96. https://doi.org/10.1214/21-AOS2050.
Frank, M. J. 1979. “On the Simultaneous Associativity of f(x, y) and x + y - f(x, y).”Aequationes Mathematicae 19: 194–226. http://eudml.org/doc/136825.
Frees, Edward W, and Emiliano A Valdez. 1998. “Understanding Relationships Using Copulas.”North American Actuarial Journal 2 (1): 1–25.
Gaetan, Carlo, and Xavier Guyon. 2008. Modélisation Et Statistique Spatiales. Mathématiques & Applications. Berlin Heidelberg New York: Springer.
Genest, C., K. Ghoudi, and L.-P. Rivest. 1995. “A Semiparametric Estimation Procedure of Dependence Parameters in Multivariate Families of Distributions.”Biometrika 82 (3): 543–52. http://www.jstor.org/stable/2337532.
Genest, Christian, and Johan Segers. 2009. “Rank-Based Inference for Bivariate Extreme-Value Copulas.”The Annals of Statistics 37 (5B): 2990–3022.
Gijbels, Irène, Marek Omelka, and Noël Veraverbeke. 2015. “Estimation of a Copula When a Covariate Affects Only Marginal Distributions.”Scandinavian Journal of Statistics 42 (4): 1109–26. http://www.jstor.org/stable/24586878.
Gonçalves, André R., Fernando J. Von Zuben, and Arindam Banerjee. 2016. “Multi-Task Sparse Structure Learning with Gaussian Copula Models.”J. Mach. Learn. Res. 17 (1): 1205–34.
Gudendorf, Gordon, and Johan Segers. 2010. “Extreme-Value Copulas.” In Copula Theory and Its Applications, 198:127–45. Lect. Notes Stat. Proc. Springer, Heidelberg. https://doi.org/10.1007/978-3-642-12465-5\_6.
Gumbel, E. J. 1960. “Distributions de Valeurs Extrêmes En Plusieurs Dimensions.”Publications de l’institut de Statistique de l’Université de Paris 9: 171–73.
Hamori, Shigeyuki, Kaiji Motegi, and Zheng Zhang. 2019. “Calibration Estimation of Semiparametric Copula Models with Data Missing at Random.”Journal of Multivariate Analysis 173: 85–109. https://doi.org/https://doi.org/10.1016/j.jmva.2019.02.003.
Hofert, Marius, Martin Mächler, and Alexander J McNeil. 2012. “Likelihood Inference for Archimedean Copulas in High Dimensions Under Known Margins.”Journal of Multivariate Analysis 110: 133–50.
Hüsler, Jürg, and Rolf-Dieter Reiss. 1989. “Maxima of Normal Random Vectors: Between Independence and Complete Dependence.”Statistics & Probability Letters 7 (4): 283–86. https://doi.org/https://doi.org/10.1016/0167-7152(89)90106-5.
Joe, H. 1990. “Families of Min-Stable Multivariate Exponential and Multivariate Extreme Value Distributions.”Statistics & Probability Letters 9: 75–81.
———. 1997. Multivariate Models and Multivariate Dependence Concepts. Chapman & Hall/CRC Monographs on Statistics & Applied Probability. Taylor & Francis. https://books.google.fr/books?id=iJbRZL2QzMAC.
Jones, Eric, Travis Oliphant, and Pearu Peterson. 2001. “SciPy: Open Source Scientific Tools for Python.”http://www.scipy.org.
Jun Yan. 2007. “Enjoy the Joy of Copulas: With a Package copula.”Journal of Statistical Software 21 (4): 1–21. https://www.jstatsoft.org/v21/i04/.
Lopez-Paz, David, Jose Miguel Hernández-Lobato, and Ghahramani Zoubin. 2013. “Gaussian Process Vine Copulas for Multivariate Dependence.” In International Conference on Machine Learning, 10–18. PMLR.
Marcon, G., S. A. Padoan, P. Naveau, P. Muliere, and J. Segers. 2017. “Multivariate Nonparametric Estimation of the Pickands Dependence Function Using Bernstein Polynomials.”Journal of Statistical Planning and Inference 183: 1–17. https://doi.org/https://doi.org/10.1016/j.jspi.2016.10.004.
Marshall, Albert W., and Ingram Olkin. 1988. “Families of Multivariate Distributions.”Journal of the American Statistical Association 83 (403): 834–41. http://www.jstor.org/stable/2289314.
McNeil, Alexander J., and Johanna Nešlehová. 2009. “Multivariate Archimedean copulas, d-monotone functions and l1-norm symmetric distributions.”The Annals of Statistics 37 (5B): 3059–97. https://doi.org/10.1214/07-AOS556.
Naveau, Philippe, Armelle Guillou, Daniel Cooley, and Jean Diebolt. 2009. “Modelling Pairwise Dependence of Maxima in Space.”Biometrika 96 (1): 1–17. https://doi.org/10.1093/biomet/asp001.
Nelsen, Roger B. 2007. An Introduction to Copulas. Springer Science & Business Media.
Nolde, Natalia, and Jennifer L. Wadsworth. 2021. “Linking Representations for Multivariate Extremes via a Limit Set.”https://arxiv.org/abs/2012.00990.
Portier, François, and Johan Segers. 2018. “On the Weak Convergence of the Empirical Conditional Copula Under a Simplifying Assumption.”Journal of Multivariate Analysis 166: 160–81. https://doi.org/https://doi.org/10.1016/j.jmva.2018.03.002.
Saunders, K., A. Stephenson, and David Karoly. 2021. “A Regionalisation Approach for Rainfall Based on Extremal Dependence.”Extremes 24 (June). https://doi.org/10.1007/s10687-020-00395-y.
Schepsmeier, U., J. Stoeber, E. C. Brechmann, B. Graeler, T. Nagler, T. Erhardt, C. Almeida, et al. 2019. “VineCopula : Statistical Inference of Vine Copulas.”Package "VineCopula". R Package, Version 2.3.0. https://cran.r-project.org/web/packages/VineCopula/index.html.
Simpson, Emma S., Jennifer L. Wadsworth, and Jonathan A. Tawn. 2021. “A Geometric Investigation into the Tail Dependence of Vine Copulas.”Journal of Multivariate Analysis 184: 104736. https://doi.org/https://doi.org/10.1016/j.jmva.2021.104736.
Sklar, Abe. 1959. “Fonctions de Répartition à n Dimensions Et Leurs Marges.”Publications de l’Institut de Statistique de l’Université de Paris 8: 229–31.
Smith, R. L. 1990. Extreme Value Theory. Handbook of Applicable Mathematics (ed. W. Ledermann), vol. 7. Chichester: John Wiley, pp. 437–471.
Stephenson, Alec. 2003. “Simulating Multivariate Extreme Value Distributions of Logistic Type.”Extremes 6 (1): 49–59.
Sun, Yi, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. 2019. “Learning Vine Copula Models for Synthetic Data Generation.”Proceedings of the AAAI Conference on Artificial Intelligence 33 (01): 5049–57. https://doi.org/10.1609/aaai.v33i01.33015049.
Van Der Walt, Stefan, S. Chris Colbert, and Gaël Varoquaux. 2011. “The NumPy array: a structure for efficient numerical computation.”Computing in Science and Engineering 13 (2): 22–30. https://doi.org/10.1109/MCSE.2011.37.
Veeramachaneni, Kalyan, Alfredo Cuesta-Infante, and Una-May O’Reilly. 2015. “Copula Graphical Models for Wind Resource Estimation.” In IJCAI.
Wooldridge, Jeffrey M. 2007. “Inverse Probability Weighted Estimation for General Missing Data Problems.”Journal of Econometrics 141 (2): 1281–301. https://doi.org/https://doi.org/10.1016/j.jeconom.2007.02.002.
6 Appendix
6.1 Bivariate Archimedean models
6.2 Implemented bivariate extreme models
6.3 Multivariate Archimedean copulae
6.4 Multivariate extreme models
Before giving the main details, we introduce some notations. Let B be the set of all nonempty subsets of \{1,\dots,d\} and B_1 = \{b \in B, |b| = 1\}, where |b| denotes the number of elements in thet set b. We note by B_{(j)} = \{b \in B, j \in b\}. For d=3, the Pickands is expressed as
where \boldsymbol{\alpha} = (\alpha_1, \dots, \alpha_4), \boldsymbol{\psi} = (\psi_1, \dots, \psi_4), \boldsymbol{\phi} = (\phi_1, \dots, \phi_4) are all elements of \Delta^3. We take \boldsymbol{\alpha} = (0.4,0.3,0.1,0.2), \boldsymbol{\psi} = (0.1, 0.2, 0.4, 0.3), \boldsymbol{\phi} = (0.6,0.1,0.1,0.2) and \boldsymbol{\theta} = (\theta_1, \dots, \theta_4) = (0.6,0.5,0.8,0.3) as the dependence parameter.
The Dirichlet model is a mixture of m Dirichlet densities, that is
h(\textbf{w}) = \sum_{k=1}^m \theta_k \frac{\Gamma(\sum_{j=1}^d \sigma_{kj})}{\Pi_{j=1}^d \Gamma(\sigma_{kj})} \Pi_{j=1}^d w_j^{\sigma_{kj}-1},
with \sum_{k=1}^m \theta_k = 1, \sigma_{kj} > 0 for k \in \{1,\dots,m\} and j \in \{1, \dots, d\}. Let \mathcal{D} \in [0, \infty)^{(d-1)\times (d-1)} denotes the space of symmetric strictly conditionnaly negative definite matrices that is
For any 2 \leq k \leq d, consider m' = (m_1, \dots, m_k) with 1 \leq m_1 < \dots < m_k \leq d define
\Sigma^{(k)}_m = 2 \left( \Gamma_{m_i m_k} + \Gamma_{m_j m_k} - \Gamma_{m_i m_j} \right)_{m_i m_j \neq m_k} \in [0,\infty)^{(d-1)\times(d-1)}.
Furthermore, note S(\cdot | \Sigma^{(k)}_m) denote the survival function of a normal random vector with mean vector \textbf{0} and covariance matrix \Sigma^{(k)}. We now define :
h_{km}(\textbf{y}) = \int_{y_k}^\infty S\left( (y_i - z + 2\Gamma_{m_i m_k})_{i=1}^{k-1} | \Gamma_{km}\right) e^{-z}dz
for 2 \leq k \leq d. We denote by \Sigma^{(k)} the summation over all k-vectors m=(m_1,\dots,m_k) with 1\leq m_1 < \dots < m_k \leq d.
6.5 Multivariate elliptical dependencies
Let \textbf{X} \sim \textbf{E}_d(\boldsymbol{\mu}, \Sigma, \psi) be an elliptical distributed random vector with cumulative distribution F and marginal F_0, \dots, F_{d-1}. Then, the copula C of F is called an elliptical copula. We denote by \phi the standard normal distribution function and \boldsymbol{\phi}_\Sigma the joint distribution function of \textbf{X} \sim \mathcal{N}_d(\textbf{0}, \Sigma), where \textbf{0} is the d-dimensional vector composed out of 0. In the same way, we note t_{\theta} the distribution function of a standard univariate distribution t distribution and by \boldsymbol{t}_{\theta, \Sigma} the joint distribution function of the vector \textbf{X} \sim \boldsymbol{t}_{d}(\theta, \textbf{0}, \Sigma). A d squared matrix \Sigma is said to be positively semi definite if for all u \in \mathbb{R}^d we have :
u^\top \Sigma u \geq 0
Source Code
---title: "Generate random number from copula : a COPPY module"date: "Last build: march 2022"author: - Alexis Boulin, <aboulin@unice.fr>, Laboratoire J.A. Dieudonné, UMR CNRS 7351, Université Côte d'Azur, Nice, Francegithub: https://github.com/Aleboul/COPPYbibliography: references.bibrepo: "COPPY"jupyter: python3---[](https://github.com/Aleboul/article)[](https://mybinder.org/v2/gh/computorg/{{< meta repo >}}/gh-pages?filepath=content.ipynb)[](http://creativecommons.org/licenses/by/4.0/)#### Abstract {.unnumbered}The modeling of dependence between random variables is an important subject in several applied fields of science. To this aim the copula function can be used as a margin-free description of the dependence structure. Several copulae belong to specific families such as Archimedean, Elliptical or Extreme. While software implementation of copulae has been thoroughly explored in R software, methods to work with copula in $\textsf{Python}$ are still in their infancy. To promote the dependence modeling with copula in $\textsf{Python}$, we have developed **COPPY**, a library that provides a range of random vector generation vector for copulae.#### Keywords {.unnumbered}Random number generation, copulas# IntroductionModeling dependence relations between random variables is a topic of interest in probability theory and statistics. The most popular approach is based on the second moment of the underlying random variables namely, the covariance. It is well known that only linear dependence can be captured by the covariance and it is characterizing only for a few models, \emph{e.g.} the multivariate normal distribution. As a beneficial alternative of dependence, the concept of copulae going back to @Skla59. The copula $C : [0,1]^d \rightarrow [0,1]$ of a random vector $\textbf{X} = (X_0, \dots, X_{d-1})$ with $d \geq 2$ allows us to separate the effect of dependence from the effect of the marginal distribution such as:$$ \mathbb{P}\left\{ X_0 \leq x_0, \dots, X_{d-1} \leq x_{d-1} \right\} = C\left(\mathbb{P} \{X_0 \leq x_0\}, \dots, \mathbb{P}\{X_{d-1} \leq x_{d-1} \}\right),$$where $(x_0, \dots, x_{d-1}) \in \mathbb{R}^d$. The main consequence of this identityis that the copula completely characterizes the stochastic dependence between themargins of $\textbf{X}$.In other words, copulae allow to model marginal distributions and dependence structure separately. Furthermore, motivated by Sklar's theorem, the problem of investigating stochastic dependences is being reduced to the problem of investigating multivariate distribution function on the unit hypercube $[0,1]^d$ with uniform margins. The theory of copulae has been of prime interest for many applied fields of science, such as quantitative finance, actuarial or environmental sciences. This increasing number of applications has led to a demand for statistical methods. To name some, semiparametric estimation @10.2307/2337532, nonparametric estimation \cite{10.2307/3318798} of copulae where investigated while @10.2307/24586878, @PORTIER2018160analyse the weak convergence of the nonparametric estimation of conditional copulae. Such results are established under a fixed arbitrary fixed dimension $d \geq 2$ but several investigations (see @10.2307/25463423, @10.1214/21-AOS2050) are done for functional data for the tail copula which capture the dependence in the upper tail. The empirical version of this function is shown to converge weakly to a Gaussian process which is similar in structure to the weak limit of the empirical copula process.While software implementation of copulas has been thoroughly explored in $\textsf{R}$software see, \emph{e.g.}, @evdR, @copulaR, @VineCopulaR, methods to work with copulain $\textsf{Python}$ are still in their infancy. As far as we know, copulas devotedmodule in $\textsf{Python}$ are mainly designed for modeling (see @CopulasPy, @CopulaePy).These modules use maximum likelihood methods to estimate parametrically the copulafrom observed data and propose to sample synthetic data through the estimated copulamodel. Nevertheless, it does not allow to generate random numbers from a given copulamodel where the user just have to specify the desired parameter. Furthermore, onlyarchimedean and elliptical families are investigated and the multivariate case hasbarely been touched. Here, we propose to implement a wide range of copulas whichinclude the extreme value class and, if possible, in arbitrary fixed dimension$d \geq 2$.Through this article we adopt the following notational conventions such as all the indices will start at $0$ as in $\textsf{Python}$. Consider $(\Omega, \mathcal{A}, \mathbb{P})$ and let $\textbf{X} = (X_0, \dots, X_{d-1})$ be $d$-dimensional random vector with values in $(\mathbb{R}^d, \mathcal{B}(\mathbb{R}^d))$, with $d \geq 2$. This random vector has a joint distribution $F$ with copula $C$ and its margins are denoted by $F_j(x) = \mathbb{P}\{X_j \leq x\}$ for all $x \in \mathbb{R}$ and $j \in \{0, \dots, d-1\}$. Denote by $\textbf{U} = (U_0, \dots, U_{d-1})$ a $d$ random vector with copula $C$ and uniform margins. All bold letters $\textbf{x}$ will denote a vector.The module **COPPY** is implemented using the object-oriented features of the $\textsf{Python}$ language. The classes are designed for Archimedean, elliptical, extreme value copulae. @sec-classes briefly presents the classes defined in the module. @sec-rng describes methods to generaterandom vectors. @sec-pairwise presents an application of the **COPPY** in the field of modelization of pairwise dependence between maxima. @sec-discussion discusses some further improvements of the module and concludes. @sec-bv_arch - @sec-mv_ellip are devoted to define and illustrate all the parametric copula models implemented in the module.# Classes {#sec-classes}:::{#fig-diagram}Object diagram that structure the code. The **Multivariate** class is the deepest root and serves to instantiate all its childs **Archimedean**, **Extreme**, **Gaussian**, **Student** in red. Grandchilds corresponds to some parametric copula models and are depicted in blue. Examples of methods are the great-grandchild colored in green.:::Three main classes are defined in the **COPPY** module : **Multivariate**, **Archimedean** and **Extreme** classes (see @fig-diagram for a description of the architecture of the code). The **Multivariate** class is designed to define multivariate copula (including the bivariate case) and is located at the highest level of the code architecture. As a generic class, it contains methods to instantiate the copula object or to sample from a copula with desired margins using inversion methods for example. At a second level, we may find its childrens, namely **Archimedean**, **Extreme** and **Gaussian** or **Student**. These are copulae which belong to different families of copulae, namely Archimedean and extreme value copula for **Archimedean** and **Extreme** and elliptical copula for **Gaussian** and **Student**. This partition is relevant theoretically as Archimedean and extreme value copulae are studied on their own (see @charpentier2009 or @genest2009rank to name a few) but also in practice as they contain effective sampling methods. However, **Gaussian** and **Student** classes are split as the most effective sampling algorithms are specific for each and cannot be generalized in a broader elliptical class. At the third level of the architecture, we may find some important Archimedean and extreme value copulae parametric models (depicted as blue in @fig-diagram). These parametric models contain methods such as the generator function $\varphi$ (see @sec-Arch) for the Archimedean and the Pickands dependence function $A$ (see @sec-Extreme) for the extreme value (in green in @fig-diagram). We recall in @sec-Arch the definition of Archimedean copulae and some of its properties in high dimensional spaces. A characterization of extreme value copulae is given in @sec-Extreme. For each section, the reader is referred to the to discover the wide range of copulae implemented in the module as @sec-bv_arch and @sec-mv_arch for Archimedean copulae, @sec-bv_ext and @sec-mv_ext for extreme value copulae and lastly @sec-mv_ellip} for Elliptical copulae.## The Archimedean class {#sec-arch}Let $\varphi$ be a generator which is a strictly decreasing, convex function from $[0,1]$ to $[0, \infty]$ such that $\varphi(1) = 0$ and $\varphi(0) = \infty$. We denote by $\varphi^\leftarrow$ the generalized inverse of $\varphi$. Let denote by$$ C(\textbf{u}) = \varphi^\leftarrow (\varphi(u_0)+ \dots + \varphi(u_{d-1})).$$ {#eq-arch_cop}If this relation holds and $C$ is a copula function, then $C$ is called an Archimedean copula. A characterization for @eq-arch_cop to be a copula is that the generator needs to be a d-monotonic function, \emph{i.e.} $\varphi$ is differentiable there up to the order $d$ and the derivatives satisfy$$ (-1)^k \left(\varphi\right)^{(k)}(x) \geq 0, \quad k \in \{1, \dots, d\}$$ {#eq-dmono}for $x \in (0, \infty)$ (see corollary 2.1 of @10.1214/07-AOS556). It is peculiarly interesting to emphasize that $d$-monotone Archimedean inverse generators do not necessarily generate Archimedean copulae in dimensions higher than $d$. To that extent, some Archimedean subclasses are only implemented for the bivariate case as in a greater dimension, they do not generate an Archimedean copula. In the bivariate case, it is worth noticing that @eq-dmono can be interpreted as $\varphi$ be a convex function.Implemented Archimedean copula classes in the module are commonly used one-parameter families, such as Clayton (@10.2307/2335289), Gumbel (@1960), Joe (@joe1997multivariate), Frank @Frank1979 and AMH (@ALI1978405) copulae for the multivariate case. Remark that every Archimedean copulae are always symmetric and in dimension $3$ or higher only positive association are allowed. For the specific bivariate case, other families such as numbers from 4.2.9 to 4.2.15 and 4.2.22 of Section $4.2$ of @nelsen2007introduction are implemented. We refer the reader to @sec-bv_arch and @sec-mv_arch for definitions and illustrations of these parametric copula models.## The Extreme class {#sec-extreme}Investigating the notion of copulae within the framework of multivariate extreme value theory leads to the so-called extreme value copulae (see @gudendorf2009extremevalue for an overview) defined as $$C(\textbf{u}) = \exp \left( - \ell(-\ln(u_0), \dots, -\ln(u_{d-1})) \right), \quad \textbf{u} \in (0,1]^d,$$ {#eq-evc}with $\ell : [0,\infty)^d \rightarrow [0,\infty)$ the stable tail dependence function which is convex, homogeneous of order one, namely $\ell(c\textbf{x}) = c \ell(\textbf{x})$ for $c > 0$ and satisfies $\max(x_0,\dots,x_{d-1}) \leq \ell(x_0,\dots,x_{d-1}) \leq x_0+\dots+x_{d-1}, \forall \textbf{x} \in [0,\infty)^d$. Denote by $\Delta^{d-1} = \{\textbf{w} \in [0,1]^d : w_0 + \dots + w_{d-1} = 1\}$ the unit simplex. By homogeneity, $\ell$ is characterized by the \emph{Pickands dependence function} $A : \Delta^{d-1} \rightarrow [1/d,1]$, which is the restriction of $\ell$ to the unit simplex $\Delta^{d-1}$ :$$ \ell(x_0, \dots,x_{d-1}) = (x_0 + \dots + x_{d-1}) A(w_0, \dots, w_{d-1}), \quad w_j = \frac{x_j}{x_0 + \dots + x_{d-1}},$${#eq-tail_dependence_pickands}for $j \in \{1,\dots,d-1\}$ and $w_0 = 1 - w_1 - \dots - w_{d-1}$ with $\textbf{x} \in [0, \infty)^d \setminus \{\textbf{0}\}$.From a practical point of view, the family of extreme value copulae is very rich and arises naturally as the limiting distribution of properly normalised component-wise maxima. Furthermore, it contains a rich variety of parametric models and allows asymmetric dependence. For the multivariate framework, the logistic copula (or Gumbel, see @1960) and the asymmetric logistic copula (@tawn1990) are implemented. We emphasize here that the logistic copula is the sole model that is both Archimedean and extreme value. Bivariate extreme value copulae which are included in the library are asymmetric logistic (@10.1093/biomet/75.3.397), asymmetric negative logistic (@Joe1990FamiliesOM), asymmetric mixed (@10.1093/biomet/75.3.397), Hüsler and Reiss (@HUSLER1989283), the t-EV (@Demarta_Mcneil),Bilogistic (@Smith1990), Dirichlet (@BoldiDavison2007) models. The reader is again invited to the @sec-bv_ext and @sec-mv_ext for precise definitions of these models.# Random number generator {#sec-rng}We propose a $\textsf{Python}$-based implementation for generate random number from a wide variety of copula. **COPPY** requires some external packages in order to work. They are very few are freely available online.- `NumPy` version 1.6.1 or newer. This is the fundamental package for scientific computing, it contains linear algebra functions and matrix / vector objects (@vanderwalt:inria-00564007).- `SciPy` version 1.7.1 or newer. A library of open-source softwarefor mathematics, science and engineering (@jones_scipy:_2001).The random vector generator methods in the code are *sample_unimargin* and *sample* for the **Multivariate** class. The first one generates a sample where margins are uniformly distributed on the unit segment $[0,1]$ while the second from the chosen margins. In @sec-biv_case, we present an algorithm used to sample from a copula using the conditioning method. This method is very general and may be used for every copula that is sufficiently smooth (see @eq-partial_deriv_arch, @eq-partial_deriv_pick and @eq-cond_dist_mv below). However, the practical infeasibility of the algorithm in a dimension higher than $2$ and the numerical inversion which is computationally intensive call for efficient ways to sample in greater dimensions. The purpose of @sec-mv_case is to present those methods and details those used in the module. In each Section, typical lines of codes are presented to instantiate a copula and to sample with **COPPY**.In the sequel, all $\textsf{Python}$ code demonstration assumes that the modules have been loaded :```{python}from coppy.rng import base, evd, archimedeanimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsplt.style.use('seaborn-whitegrid')```## The bivariate case {#sec-biv_case}In this subsection, we will address the problem of generating a bivariate sample from a specified joint distribution with $d = 2$. Suppose that we want to sample a bivariate random vector $\textbf{X}$ with copula $C$. If the components are independents, the procedure is straightforward as we can sample in one hand from $X_0$ and from $X_1$ on the other hand. In the case of copula, this case is only a special one.One procedure to generate a pair $(u_0,u_1)$ from $\textbf{U}$ is the conditional distribution method. This method is convenient if the law of $U_1$ given $U_0$ can be easily sampled. An interesting property of copulae is the link between the conditional distribution and the derivative of the copula function (see @nelsen2007introduction, page 41), namely$$ c_{u_0}(u_1) \triangleq \mathbb{P}\left\{ U_1 \leq u_1 | U_0 = u_0 \right\} = \frac{\partial C(u_0,u_1)}{\partial u_0}.$$ {#eq-cond_sim}Thus, a convenient algorithm to simulate bivariate copula as long as the copula admits a first partial derivative with respect to its first component is given by Algorithm 1.::: {#fig:alg_1}:::For the sixth step of Algorithm 1, it is equivalent to find $u_1 \in [0,1]$ such that $c_{u_0}(u_1) - t_1 = 0$ holds. This $u_1$ always exists because for every $u \in ]0,1[$, $ 0 \leq c_{u_0}(u) \leq 1$ and that $u \mapsto c_{u_0}(u)$ is an increasing function (see Theorem 2.2.7 of \emph{op. cit.} for a proof). This step is solved using the \texttt{brentq} function from `SciPy` module. A sufficient condition in order that $C$ admits a first partial derivative in the Archimedean and extreme value case is that the generator $\varphi$ and the Pickands dependence function $A$ are continuously differentiable on $]0,1[$. Then the first partial derivatives of $C$ are given respectively by :$$ \frac{\partial C}{\partial u_0}(u_0,u_1) = \frac{\varphi'(u_0)}{\varphi'(C(u_0,u_1))}, \quad (u_0,u_1) \in ]0,1[^2,$$ {#eq-partial_deriv_arch}$$ \frac{\partial C}{\partial u_0}(u_0,u_1) = \frac{\varphi'(u_0)}{\varphi'(C(u_0,u_1))}, \quad (u_0,u_1) \in ]0,1[^2,$$ {#eq-partial_deriv_pick}where $t = log(u_1) / log(u_0u_1) \in (0,1)$ and $\mu(t) = A(t) - tA'(t)$. We now have all the necessary theoretical tools to give details on how the **COPPY** module is designed. The file *base.py* contains the **Multivariate** class where the *sample* method sample from $\textbf{X}$ with copula $C$. To do so, we use the inversion method that is to sample from $\textbf{U}$ using Algorithm 1 and we compose the corresponding uniform margins by $F_j^\leftarrow$. @eq-cond_sim indicate that the sole knowledge of $A$ and $\varphi$ and their respective derivatives is needed in order to perform the sixth step of Algorithm 1. For that purpose, Algorithm 1 named as *cond_sim* in the code and is located inside the **Archimedean** and **Extreme** class. Then each child of the bivariate **Archimedean** (\emph{resp.} **Extreme**) class is thus defined by its generator $\varphi$ (resp. $A$), it's derivative $\varphi'$ (\emph{resp.} $A'$) and it's inverse $\varphi^\leftarrow$ as emphasized in greed in @fig-diagram. Namely, we perform Algorithm 1 for the **Archimedean** subclasses *Frank*, *AMH*, *Clayton* (when $\theta < 0$ for the previous three), *Nelsen_9*, *Nelsen_10*, *Nelsen_11*, *Nelsen_12*, *Nelsen_13*, *Nelsen_14*, *Nelsen_15* and *Nelsen_22*. For the **Extreme** class, such algorithm is performed for the *Asy_neg_log* and *Asy_mixed* models. For other models, faster algorithms are known and thus implemented, we refer to @sec-mv_case for details.The following code illustrates the random vector generation for a bivariate Archimedean copula. By defining the parameter of the copula and the sample's length, constructor of this copulae are available and might be call using $Clayton()$ method such as :```{python} n_sample, theta =1024, -0.5 copula = archimedean.Clayton(theta = theta, n_sample = n_sample)```To obtain a sample with uniform margins and Clayton copula, we make use of the $\texttt{sample\_unimargin}$ as ```{python} sample = copula.sample_unimargin()```Here's, the $\texttt{sample}$ object is a $\texttt{Numpy}$ array with $2$ columns and $1024$ rows where each row contains a realization from a Clayton copula ```{python} seagreen = sns.light_palette("seagreen", as_cmap =True) fig, ax = plt.subplots() ax.scatter(sample[:,0], sample[:,1], color ='white', edgecolor = seagreen(0.25),s =5) ax.set_xlabel(r'$u_0$') ax.set_ylabel(r'$u_1$') plt.show()```## The multivariate case {#sec-mv_case}We will now address the simulation of multivariate Archimedean and Extreme value copulae proposed on the $COPPY$ module. In the multivariate case, the link between partial derivatives conditioning remains. Indeed, let $(U_0, \dots, U_{d-1})$ be a $d$-dimensional random vector with uniform margins and copula $C$, the conditional distribution of $U_k$ given the values of $U_0, \dots, U_{k-1}$ is$$ \mathbb{P}\left\{ U_k \leq u_k | U_0 = u_0, \dots, U_{k-1} = u_{k-1} \right\} = \frac{\partial^{k-1} C(u_0, \dots, u_k,1,\dots,1)/\partial u_0 \dots \partial u_{k-1}}{\partial^{k-1} C(u_0, \dots, u_{k-1},1,\dots,1) / \partial u_0 \dots \partial u_{k-1}}$$ {#eq-cond_dist_mv}for $k \in \{1,\dots, d-1\}$. The conditional simulation algorithm may be written as follows.1. Generate $d$ independent uniform random on $[0,1]$ variates $v_0, \dots, v_{d-1}$.2. Set $u_0 = v_0$.3. For $k = 1, \dots, d-1$, evaluate the inverse of the conditional distribution given by @eq-cond_dist_mv at $v_k$, to generate $u_k$.Nevertheless, evaluation of the inverse conditional distribution becomes increasingly complicated as the dimension $d$ increases. Furthermore, it can be difficult for some models to derive a closed form of @eq-cond_dist_mv that makes it impossible to implement it in a general algorithm where only the dimension $d$ is an input. For multivariate Archimedean copulae, @10.1214/07-AOS556 give a method to generate a random vector from the $d$-dimensional copula $C$ with generator $\varphi$ (see Section 5.2 of *loc. cit.*). A stochastic representation for Archimedean copulae generated by a $d$-monotone generator is given by$$\textbf{U} = \left( \varphi^\leftarrow(R S_1), \dots, \varphi^\leftarrow(RS_d) \right) \sim C,$$ {#eq-radial}where $R \sim F_R$, the radial distribution which is independent of $S$ and $S$ is distributed uniformly in the unit simplex $\Delta^{d-1}$. One of challenging aspect of this algorithm is to have an accurate evaluation of the radial distribution of the Archimedean copula and thus to numerically inverse this distribution. The associated radial distribution for the $\textsf{Clayton}$ copula is given in Example 3.3 in the cited paper above while those of the $\textsf{Joe}$, $\textsf{AMH}$, $\textsf{Gumbel}$ and $\textsf{Frank}$ copulae are given in @hofert2012likelihood. In general, one can use numerical inversion algorithms for computing the inverse of the radial distribution, however it will leads to spurious numerical errors. Other algorithms exist when the generator is known to be the Laplace-Stieljes transform, denoted as $\mathcal{LS}$, of some positive random variable (see @10.2307/2289314, @frees1998understanding). This positive random variable is often referenced as the frailty distribution. In this framework, Archimedean copulae allow for the stochastic representation$$ \textbf{U} = \left( \varphi^\leftarrow (E_1/V), \dots, \varphi^\leftarrow(E_d /V)\right) \sim C,$$ with $V \sim F = \mathcal{LS}^{-1}[\varphi^\leftarrow]$ the frailty and $E_1, \dots, E_d$ are distributed *i.i.d.* according to a standard exponential and independent of $V$. The algorithm to sample from this is thus :::: {#fig:alg_2}:::In this framework, we define $\texttt{\_frailty\_sim}$ method defined inside the $\textbf{Archimedean}$ class which performs Algorithm 2. Then, each Archimedean copula where the frailty distribution is known are thus defined by the generator $\varphi$, it's inverse $\varphi^\leftarrow$ and the frailty distribution denoted as $\mathcal{LS}^{-1}[\varphi^\leftarrow]$ as long as we know the frailty. This is the case for $\texttt{Joe}$, $\texttt{Clayton}$, $\texttt{AMH}$ or $\texttt{Frank}$.For the extreme value case, algorithms have been proposed as in @stephenson2003simulating (see Algorithms 2.1 and 2.2) who proposes sampling methods for the Gumbel and the asymmetric logistic model. These algorithms are implemented in the **COPPY** module. Note that these algorithms are model-specific, thus the $\texttt{sample\_unimargin}$ method is exceptionally located in the corresponding child of the multivariate **Extreme** class. Another procedure designed by @10.1093/biomet/asw008 to sample from multivariate extreme value models using extremal functions (see Algorithm 2 of the reference cited above) is also of prime interest. For the implemented models using this algorithm, namely Hüsler-Reiss, tEV, Bilogistic and Dirichlet models, a method called $\texttt{rExtFunc}$ is located inside each classes which allows to generate an observation from the according law of the extremal function.Sample from the Gaussian and Student copula are directly given by Algorithm 5.9 and 5.10 respectively of @quantrisk. As each algorithm is model specific, the $\texttt{sample\_unimargin}$ method is located inside the **Gaussian** and **Student** classes.We present how to construct a multivariate Archimedean copula and to generate random vectors from this model. Introducing the parameters of the copula, we appeal the following lines to construct our copula object :```{python}d, theta, n_sample =3, 2.0, 1024copula = archimedean.Clayton(theta = theta, n_sample = n_sample, d = d)```We now call the $\texttt{sample\_unimargin}$ method to obtain vectors randomly generate.```{python}sample = copula.sample_unimargin()```We thus represent in three dimensions these realizations below.```{python} fig = plt.figure() ax = fig.add_subplot(111, projection ='3d') ax.scatter3D(sample[:,0], sample[:,1], sample[:,2], s =5, edgecolor = seagreen(0.5), color ='white') ax.set_xlabel(r'$u_0$') ax.set_ylabel(r'$u_1$') ax.set_zlabel(r'$u_2$') plt.show()```# Case study : Modeling pairwise dependence between spatial maximas with missing data {#sec-pairwise}We now proceed to a case study where we use our \textsf{Python} module to assess, under a finite sample framework, the asymptotic properties of an estimator of the $\lambda$-madogram when data are completely missing at random (MCAR). This case study comes from numerical results of @boulin2021non. The $\lambda$-madogram belongs to a family of estimators, namely the madogram, which is of prime interest in environmental sciences, as it is designed to model pairwise dependence between maxima in space, see \emph{e.g.} @bernard:hal-03207469 @BADOR201517 @saunders where the madogram was used as a dissimilarity measure to perform clustering. Where in several fields, for example econometrics (@woolridge2007) or survey theory (@chauvet2015), the MCAR hypothesis appears to be a strong hypothesis, this hypothesis is more realistic in environmental research as the missingness of one observation is usually due to instruments, communication and processing errors that may be reasonably supposed independent of the quantity of interest. In @sec-background, we define objects and properties of interests while in @sec-num we describe a detailed tutorial in $\textsf{Python}$ and with **COPPY** module to compare the asymptotic variance with an empirical counterpart of the $\lambda$-madogram with $\lambda = 0.5$.## Background {#sec-background}It was emphasized that the possible dependence between maxima can be described with the extreme value copula. This function is completely characterized by the Pickands dependence function (see @eq-tail_dependence_pickands}) where the latter is equivalent to the $\lambda$-madogram introduced by @naveau:hal-00312758and defined as$$ \nu(\lambda) = \mathbb{E}\left[ \left|\{F_0(X_0)\}^{1/\lambda} - \{F_1(X_1)\}^{1/(1-\lambda)} \right|\right],$$ {#eq-lmbd_mado}with $\lambda \in (0,1)$ and if $\lambda = 0$ and $0<u<1$, then $u^{1/\lambda} = 0$ by convention. The latter quantity took its inspiration from the extensively used geostatistics tool, namely, the variogram (see Chapter 1.3 of @alma991005826659705596 for a definition and some classical properties). The $\lambda$-madogram can be interpreted as the $L_1$-distance between the uniform margins elevated to the inverse of the corresponding weights $\lambda$ and $1-\lambda$. This quantity describes the dependence structure between extremes by its relation with the Pickands dependence function, if we suppose that $C$ is an extreme value copula as in @eq-evc, we have$$ A(\lambda) = \frac{\nu(\lambda) + c(\lambda)}{1-\nu(\lambda) - c(\lambda)},$$ {#eq-pickands_mado}with $c(\lambda) = 2^{-1} (\lambda / (1-\lambda) + (1-\lambda)/\lambda)$ (see Proposition 3 of @MARCON20171 for details).We consider independent and identically distributed *i.i.d.* copies $\textbf{X}_1, \dots, \textbf{X}_n$ of $\textbf{X}$. In presence of missing data, we do not observe a complete vector $\textbf{X}_i$ for $i \in \{1,\dots,n\}$. We introduce $\textbf{I}_i \in \{0,1\}^2$ which satisfies, $\forall j \in \{0,1\}$, $I_{i,j} = 0$ if $X_{i,j}$ is not observed. To formalize incomplete observations, we introduce the incomplete vector $\tilde{\textbf{X}}_i$ with values in the product space $\bigotimes_{j=1}^2 (\mathbb{R} \cup \{\textsf{NA}\})$ such as$$ \tilde{X}_{i,j} = X_{i,j} I_{i,j} + \textsf{NA} (1-I_{i,j}), \quad i \in \{1,\dots,n\}, \, j \in \{0,\dots, d-1\}.$$We thus suppose that we observe a $4$-tuple such as$$ (\textbf{I}_i, \tilde{\textbf{X}}_i), \quad i \in \{1,\dots,n\},$$ {#eq-missing_2}for all $i \in \{1, \dots,n \}$, $\textbf{I}_{i}$ are \emph{i.i.d} copies from $\textbf{I} = (I_0, I_1)$ where $I_j$ is distributed according to a Bernoulli random variable $\mathcal{B}(p_j)$ with $p_j = \mathbb{P}(I_j = 1)$ for $j \in \{0,1\}$. We denote by $p$ the probability of observing completely a realization from $\textbf{X}$, that is $p = \mathbb{P}(I_0=1, I_1 = 1)$. In @boulin2021non, hybrid and corrected estimators, respectively denoted as $\hat{\nu}_n^{\mathcal{H}}$ and $\hat{\nu}_n^{\mathcal{H*}}$, are proposed to estimate nonparametrically the $\lambda$-madogram in presence of missing data completely at random. Furthermore, a closed expression of their asymptotic variances for $\lambda \in ]0,1[$ is also given. This result is summarized in the following proposition.::: {#thm-line}## @boulin2021nonLet $(\textbf{I}_i, \tilde{\textbf{X}_i})_{i=1}^n$ be a samble given by @eq-missing_2. For $\lambda \in ]0,1[$, if $C$ is an extreme value copula with Pickands dependence function $A$, we have as $n \rightarrow \infty$\begin{align*} &\sqrt{n} \left(\hat{\nu}_n^{\mathcal{H}}(\lambda) - \nu( \lambda)\right) \overset{d}{\rightarrow} \mathcal{N}\left(0, \mathcal{S}^{\mathcal{H}}(p_1,p_2,p, \lambda)\right), \\ &\sqrt{n} \left(\hat{\nu}_n^{\mathcal{H}*}(\lambda) - \nu( \lambda)\right) \overset{d}{\rightarrow} \mathcal{N}\left(0, \mathcal{S}^{\mathcal{H}*}(p_1,p_2,p, \lambda)\right),\end{align*}where $\mathcal{S}^{\mathcal{H}}(p_1,p_2,p, \lambda)$ and $\mathcal{S}^{\mathcal{H}*}(p_1,p_2,p, \lambda)$ are the asymptoptic variances of the random variables.:::## Numerical results {#sec-num}Benefiting of generating data with **COPPY** we are thus able, with Monte Carlo simulation, to assess theoretical results given by Theorem 1 in a finite sample setting. For that purpose, we implement a $\textsf{Monte\_Carlo}$ class (in $\texttt{monte\_carlo.py}$ file) which contains some methods to perform some Monte Carlo iterations for a givenextreme value copula. Before going any further, we have to import the necessary libraries```{python}from coppy.rng import base, evd, archimedean, monte_carlofrom scipy.stats import norm, expondef gauss_function(x, x0, sigma):return np.sqrt( 1/ (2* np.pi * sigma **2 ) ) * np.exp(- ( x - x0) **2/ (2* sigma **2) )```We thus set up parameters to simulate our bivariate dataset. For this subsection, we choose the asymmetric negative logistic model (see @sec-bv_ext for a definition) with parameters $\theta = 10, \psi_1 = 0.1, \psi_2 = 1.0$. ```{python} np.random.seed(42) n_sample =2048 theta, psi1, psi2 =10, 0.1, 1.0```We choose the standard normal and exponential as margins. To simulate this sample, the following lines should be typed:```{python} copula = evd.Asy_neg_log(theta = theta, psi1 = psi1, psi2 = psi2, n_sample = n_sample) sample = copula.sample(inv_cdf = [norm.ppf, expon.ppf])```The $2048 \times 2$ array \texttt{sample} contains $2048$ realization of the asymmetric negative logistic model where the first column is distributed according to a standard normal random variable and the second column as a standard exponential. This distribution is depicted below. To obtain it, one needs the following lines of command :```{python} seagreen = sns.light_palette("seagreen", as_cmap =True) fig, ax = plt.subplots() ax.scatter(sample[:,0], sample[:,1], color = seagreen(0.25),s =1, marker ='.') ax.set_xlabel(r'$x_0$') ax.set_ylabel(r'$x_1$') plt.show()```Before going into further details, we will present the missing mechanism. Let $V_0$ and $V_1$ be random variables uniformly distributed under the $]0,1[$ segment with copula $C_{(V_0,V_1)}$. We set $I_0 = 1_{\{V_0 \leq p_0\}}$ and $ I_1 = 1_{\{V_1 \leq p_1\}}$. It is thus immediate that $I_0 \sim \mathcal{B}(p_0)$ and $I_1 \sim \mathcal{B}(p_1)$ and $p \triangleq \mathbb{P}\{I_0 = 1, I_1 =1 \} = C_{(V_0,V_1)}(p_0, p_1)$. For our illustration, we will take $C_{(V_0,V_1)}$ as a $\texttt{Joe}$ copula with parameter $\tau = 2.0$ (We refer to Appendix @sec-bv_arch} for a representation of this copula). Thereby, if $X_0$ is not observed, it is more likely to not observe $X_1$ also. *A contrario*, the observation of $X_0$ doesn't influence the observation of the other random variable $X_1$. Indeed, for this copula, more likely is to observe a realization $v_0 \geq 0.8$ from $V_0$ if $v_1 \geq 0.8$ from $V_1$. If we observe $v_1 < 0.8$, the realization $v_0$ is close to be independent of $v_1$. In climate studies, extreme events could damage the recording instrument in the surrounding regions where they occur, thus the missingness of one variable may depend from others. We initialize the copula $C_{(V_0,V_1)}$ by the following line```{python} copula_miss = archimedean.Joe(theta =2.0, n_sample = n_sample)```For a given $\lambda \in ]0,1[$ we now want to estimate a $\lambda$-madogram with a sample simulate from the asymmetric negative logistic and where some observations would be of lack by the missing mechanism described above. We thus replicate this step several times to compute an empirical counterpart of the asymptotic variance. The $\texttt{Monte\_Carlo}$ object has been designed in this way : we specify the number of iterations $n_{iter}$ (take $n_{iter} = 1024$), the chosen extreme value copula (asymmetric negative logistic model), the missing mechanism (described by $C_{(V_0,V_1)}$ and $p_0 = p_1 = 0.9$) and $\lambda$ (noted $\texttt{w}$). We thus write the following lines :```{python} u = np.array([0.9, 0.9]) n_iter, P, w =1024, [[u[0], copula_miss._C(u)], [copula_miss._C(u), u[1]]], np.array([0.5,0.5]) monte = monte_carlo.Monte_Carlo(n_iter = n_iter, n_sample = n_sample, copula = copula, copula_miss = copula_miss, w = w, P = P)```The $\texttt{Monte\_Carlo}$ object is thus initialized with all parameters needed. We may use the $\texttt{simu}$ method to generate a $\texttt{DataFrame}$ (a $\texttt{Pandas}$ object) composed out $1024$ rows and $3$ columns. Each row contains an estimate of the $\lambda$-madogram, $\hat{\nu}_n^{\mathcal{H}*}$ in Theorem 1 ($\texttt{FMado}$), the sample length $n$ ($\texttt{n}$) and the normalized estimation error ($\texttt{scaled}$). We thus call the $\texttt{simu}$ method.```{python} df_wmado = monte.finite_sample(inv_cdf = [norm.ppf, expon.ppf], corr =True)print(df_wmado.head())```Where $\texttt{corr = True}$ specifies that we compute the corrected estimator, $\hat{\nu}_n^{\mathcal{H}*}$ in Theorem 1. Now, using the $\texttt{var\_mado}$ method defined inside in the **Extreme** class, we obtain the asymptotic variance for the given model and parameters from the missing mechanism. We obtain this quantity as follows```{python} var_mado = copula.var_mado(w, p = copula_miss._C(u), P = P, corr =True)print(var_mado)print(df_wmado['scaled'].var())```We propose here to check numerically the asymptotic normality with variance $\mathcal{S}^{\mathcal{H}*}$ of the normalized estimation error of the corrected estimator. We have all data in hand and the asymptotic variance was computed by lines above. We thus write :```{python} sigma = np.sqrt(var_mado) x = np.linspace(min(df_wmado['scaled']), max(df_wmado['scaled']), 1000) gauss = gauss_function(x, 0, sigma) sns.displot(data = df_wmado, x ="scaled", color = seagreen(0.5), kind ='hist', stat ='density', common_norm =False, alpha =0.5, fill =True, linewidth =1.5) plt.plot(x,gauss, color ='darkblue')```# Discussion {#sec-discussion}This article presents the construction and some implementations of the $\textsf{Python}$ module **COPPY** for random copula sampling. This is a seminal work of software implement of copula modeling in $\textsf{Python}$ and there's much more to do. It is implicitly hoped that the potential diffusion of the software through everyone who need it may bring other implementations for multivariate modeling with copulae under $\textsf{Python}$. For example, choosing a copula to fit the data is an important but difficult problem. A robust approach to estimate copulae has been investigated recently by @alquier2020estimation using Maximum Mean Discrepancy. In link with our example, semiparametric estimation of copulae with missing data could be of great interest as proposed by @HAMORI201985.Also, implement algorithm proposed by @10.1214/07-AOS556 to generate random vectors for Archimedean copulae has been tackled but, as expected, the numerical inversion gives spurious results especially where the parameter $\theta$ and the dimension $d$ are high. Furthermore, as the support of radial distribution is contained in the real line, the numerical inversion induces a greater time of computation. Further investigations are thus needed in order to generate random vectors from classical Archimedan models using the radial distribution.A direction of improvement for the **COPPY** module is dependence modeling with Vine copula which has been recently a tool of high interest in the machine learning community see, \*e.g.*, @lopez2013gaussian, @Veeramachaneni2015CopulaGM, @Carrera2016VineCC, @10.5555/2946645.2946678 or @SunCuesta-InfanteVeeramachaneni2019. Therefore, it strengthens the need of dependence modeling with copulae in $\textsf{Python}$ as a not negligeable part of the machine learning community use this language. In link with this article, Vine copulae might be interesting to model dependencies between extreme as suggested by @SIMPSON2021104736, @nolde2021linking. Last, another copula model could be implemented to model further dependencies. These implementations will push forward the scope of dependence modeling with $\textsf{Python}$ and gives high quality usable tools for everyone in needs.# References {.unnumbered}::: {#refs}:::# Appendix {#appendices}## Bivariate Archimedean models {#sec-bv_arch}::: {#fig-bv_arch_1}{width=50%}:::::: {#fig-bv_arch_2}{width=50%}:::## Implemented bivariate extreme models {#sec-bv_ext}::: {#fig-bv_ext}{width=50%}:::## Multivariate Archimedean copulae {#sec-mv_arch}::: {#fig-mv_arch}{width=50%}:::## Multivariate extreme models {#sec-mv_ext}Before giving the main details, we introduce some notations. Let $B$ be the set of all nonempty subsets of $\{1,\dots,d\}$ and $B_1 = \{b \in B, |b| = 1\}$, where $|b|$ denotes the number of elements in thet set $b$. We note by $B_{(j)} = \{b \in B, j \in b\}$. For $d=3$, the Pickands is expressed as::: {#eq-mv_ext}\begin{align*} A(\textbf{w}) =& \alpha_1 w_1 + \psi_1 w_2 + \phi_1 w_3 + \left( (\alpha_2 w_1)^{\theta_1} + (\psi_2w_2)^{\theta_1} \right)^{1/\theta_1} + \left( (\alpha_3 w_2)^{\theta_2} + (\phi_2w_3)^{\theta_2} \right)^{1/\theta_2} \\ &+ \left( (\psi_3 w_2)^{\theta_3} + (\phi_3w_3)^{\theta_3} \right)^{1/\theta_3} + \left( (\alpha_4 w_1)^{\theta_4} + (\psi_4 w_2)^{\theta_4} + (\phi_4 w_3)^{\theta_4} \right)^{1/\theta_4},\end{align*}::: where $\boldsymbol{\alpha} = (\alpha_1, \dots, \alpha_4), \boldsymbol{\psi} = (\psi_1, \dots, \psi_4), \boldsymbol{\phi} = (\phi_1, \dots, \phi_4)$ are all elements of $\Delta^3$. We take $\boldsymbol{\alpha} = (0.4,0.3,0.1,0.2)$, $\boldsymbol{\psi} = (0.1, 0.2, 0.4, 0.3)$, $\boldsymbol{\phi} = (0.6,0.1,0.1,0.2)$ and $\boldsymbol{\theta} = (\theta_1, \dots, \theta_4) = (0.6,0.5,0.8,0.3)$ as the dependence parameter.The Dirichlet model is a mixture of $m$ Dirichlet densities, that is$$ h(\textbf{w}) = \sum_{k=1}^m \theta_k \frac{\Gamma(\sum_{j=1}^d \sigma_{kj})}{\Pi_{j=1}^d \Gamma(\sigma_{kj})} \Pi_{j=1}^d w_j^{\sigma_{kj}-1},$$with $\sum_{k=1}^m \theta_k = 1$, $\sigma_{kj} > 0$ for $k \in \{1,\dots,m\}$ and $j \in \{1, \dots, d\}$. Let $\mathcal{D} \in [0, \infty)^{(d-1)\times (d-1)}$ denotes the space of symmetric strictly conditionnaly negative definite matrices that is::: {#striclty_cond_neg}\begin{align*} \mathcal{D}_{k} = \Big\{ \Gamma \in [0,\infty)^{k \times k} : a^\top \Gamma a < 0 \; \textrm{for all } a \in \mathbb{R}^{k} \setminus \{\textbf{0}\} \, \textrm{with } \sum_{j=1}^{d-1} a_j = 0, \\ \Gamma_{ii} = 0, \Gamma_{ij} = \Gamma_{ji}, \quad 1 \leq i,j\leq k \Big\}. \end{align*}::: For any $2 \leq k \leq d$, consider $m' = (m_1, \dots, m_k)$ with $1 \leq m_1 < \dots < m_k \leq d$ define$$ \Sigma^{(k)}_m = 2 \left( \Gamma_{m_i m_k} + \Gamma_{m_j m_k} - \Gamma_{m_i m_j} \right)_{m_i m_j \neq m_k} \in [0,\infty)^{(d-1)\times(d-1)}.$$Furthermore, note $S(\cdot | \Sigma^{(k)}_m)$ denote the survival function of a normal random vector with mean vector $\textbf{0}$ and covariance matrix $\Sigma^{(k)}$. We now define :$$ h_{km}(\textbf{y}) = \int_{y_k}^\infty S\left( (y_i - z + 2\Gamma_{m_i m_k})_{i=1}^{k-1} | \Gamma_{km}\right) e^{-z}dz $$for $2 \leq k \leq d$. We denote by $\Sigma^{(k)}$ the summation over all $k$-vectors $m=(m_1,\dots,m_k)$ with $1\leq m_1 < \dots < m_k \leq d$.::: {#fig-mv_ext}{width=50%}:::## Multivariate elliptical dependencies {#sec-mv_ellip}Let $\textbf{X} \sim \textbf{E}_d(\boldsymbol{\mu}, \Sigma, \psi)$ be an elliptical distributed random vector with cumulative distribution $F$ and marginal $F_0, \dots, F_{d-1}$. Then, the copula $C$ of $F$ is called an elliptical copula. We denote by $\phi$ the standard normal distribution function and $\boldsymbol{\phi}_\Sigma$ the joint distribution function of $\textbf{X} \sim \mathcal{N}_d(\textbf{0}, \Sigma)$, where $\textbf{0}$ is the $d$-dimensional vector composed out of $0$. In the same way, we note $t_{\theta}$ the distribution function of a standard univariate distribution $t$ distribution and by $\boldsymbol{t}_{\theta, \Sigma}$ the joint distribution function of the vector $\textbf{X} \sim \boldsymbol{t}_{d}(\theta, \textbf{0}, \Sigma)$. A $d$ squared matrix $\Sigma$ is said to be positively semi definite if for all $u \in \mathbb{R}^d$ we have :$$ u^\top \Sigma u \geq 0$$::: {#fig-mv_elli}{width=50%}:::